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Abstract We propose a randomized algorithm for training Support vector machines(SVMs) 
on large datasets. By using ideas from Random projections we show that the combi- 
natorial dimension of SVMs is 0(log n) with high probability. This estimate of com- 
binatorial dimension is used to derive an iterative algorithm, called RandSVM, which 
at each step calls an existing solver to train SVMs on a randomly chosen subset of 
size 0(log n). The algorithm has probabilistic guarantees and is capable of training 
SVMs with Kernels for both classification and regression problems. Experiments done 
on synthetic and real life data sets demonstrate that the algorithm scales up existing 
SVM learners, without loss of accuracy. 
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1 Introduction 

Consider a training data set D = {(a^i, i/i), i — 1 ■ . .n} where Xj G are data points 
and Hi are labels. The problem of learning a linear classifier, y = sign{w^ x + b), where 
y = {1,-1} or a linear function y — x -\- h when y is a scalar can be understood 
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as estimating {w, b} from D. Over the years Support Vector Machines(SVMs) have 
emerged as powerful tools for estimating such functions. In this paper we concentrate on 
developing randomized algorithms for learning SVMs on large datasets. For a detailed 
review of SVM classification and SVM regression please see [18| . 

To develop notation we briefly discuss the problem of training linear classifiers. The 
SVM formulation for linearly separable datasets is given by [TS] 

min,„ b ^\\w\\'^ 
s.t. y,i{rD^ Xi + b) > l,i — 1 . . .n 

where | = V vj^w, is the euclidean norm of w. The formulation has very interesting 
geometric underpinnings [5] . It can be understood as computing the distance between 
convex hulls of the sets {xi\yi = 1} and {xj\yj — —!}■ For linearly non-separable 
datasets the following formulation 
C-SVM-1: 



s.t. y^iw^x.^ + 6) > 1 - Ci, > 0, i = 1 . . . n 

which will be called C — SVM, again due to [18], can be used. This formulation do not 
have an elegant geometric interpretation like the separable case, but one can consider 
C-SVMs as computing the distance between two reduced convex hulls [5]. 

Both the formulations are instances of Abstract Optimization Problem(AOP) jHO 
lllj . An AOP is defined as follows: 

Definition 1 (AOP) An AOP is a triple {H,<,$) where H is a finite set, < a 
total ordering on 2^ , and 4? an oracle that, for a given F (Z G H , either reports 
F = min<{F'\F' C G} or returns a set F' C G with F' < F. 

Every AOP has a combinatorial dimension associated with it; the combinatorial di- 
mension captures the notion of number of free variables for that AOP. An AOP can 
be solved by a randomized algorithm by selecting subsets of size greater than the com- 
binatorial dimension of the problem [TT] . We wish to exploit this property of AOPs to 
design randomized algorithms for SVMs. 

The idea is to develop an iterative algorithm where in each step one needs to solve 
a SVM formulation on a small subset of the training data. Crucial to this idea is the 
size of the subset which is tied to the combinatorial dimension of the SVM formulation. 
To this end note that at optimality w is given by 

^ = ^ aiyiXi, (1) 

i:Qi>0 

for both the separable and non-separable case. Using the a variables one can define 
the set of Support vectors (SVs), 

S = {x,\a, > 0} (2) 

which defines w. The set 5* may not be unique, though w is. The combinatorial di- 
mension of SVMs is given by the minimum number of SVs required to define w. More 
formally 

Zi = min|S| (3) 
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where \S\ is the cardinality of the set S. 

The parameter A does not change with number of examples n, and is often much 
less than n. Apriori the value of A is not known, but for linearly separable classification 
problems the following holds: 2 < A < d+ 1. This follows from the observation that it 
computes the distance between 2 non-overlapping convex hulls [5] . When the problem 
is not linearly separable, the reduced convex hull interpretation leads to a very crude 
upper bound, which is much larger than d. 

The idea of iterating over randomly sampled subsets of size greater than A, for 
training SVMs was first explored by [313], and the resulting algorithm was called 
RandSVM. The RandSVM procedure iterates over subsets of size proportional to A , 
as shown in Algorithm [T] However as the authors noted that RandSVM is not practical 
because of the following reasons. For linear classifiers the sample size is too large in 
case of high dimensional data sets. For non-linear SVMs |18) the dimension of feature 
space is usually unknown when using kernels. Even in this case one can obtain a very 
crude upper-bound on A by the reduced convex hull approach but is not really useful 
as the number obtained is very large. 



Algorithm 1 RandSVM{D,A) 

Require: D - Dataset 
Require: A - Combinatorial Dimension 
1: Sample size r = dA^ 

2: Set weights w{xi) to be 1 for all examples in D. For any set A G D, let w{A) = 
3: repeat 

4: Select a sample S of size r randomly according to w. 

5: Use a SVM solver to solve the smaller problem. Let the classifier obtained be C. 
6: Classify the non sampled documents DS. 

7: Let V be the set of misclassified documents and let v be the size of V. 

8: if {w{V) < w{D)/{3A)) then 

9: Double the weights of misclassified documents. 

10: end if 
11: until V = 
12: Done 



This work overcomes the above problems using ideas from random projections [141 
|51[T] and randomized algorithms [51 lll|[T? |. As mentioned by the authors of RandSVM, 
the biggest bottleneck in their algorithm is the value of A as it is too large. The main 
contribution of this work is, using ideas from random projections, the conjecture that 
if RandSVM is solved using A equal to O(logn), then the solution obtained is close 
to optimal with high probability(Theorem |31 particularly for linearly separable and 
almost separable data sets. Almost separable data sets are those which become linearly 
separable when a small number of properly chosen data points are deleted from them. 
The second contribution is an algorithm which, using ideas from randomized algorithms 
for Linear Programming(LP), solves the SVM problem by using samples of size linear 
in A. This work also shows that the theory can be applied to non-linear kernels. The 
formulation naturally applies to regression problems. 

The paper is organized as follows: Section [5] introduces the previous work, Section|3] 
presents the improved algorithm for classification for almost linearly separable data. 
Section [4] presents the improved algorithm for the e— tube regression formulation. We 
present our results and conclusions in Section [5] and [6] 
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2 Past Work 

We begin by reviewing some results from random projections [l]. The data points in 
R!^ are projected into a random k dimensional subspace where k <^ d. Then, we look 
at a few algorithms which focus on large scale classification. 



2.1 Random Projection 

The following lemma discusses how the L2 norm of a vector is preserved when it is 
projected on a random subspace. 

Lemma 1 Let R = (rij) be a random dxk matrix, such that each entry (rij) is chosen 
independently according to N{0, 1). For any fixed vector u £ R'^ , and any e > 0, let 
u = . Then E[\\u'\\'^] = and the following bounds hold: 

(1 - < < (1 + 

With probability at least 1 — 2e^ ' 4 . 

The following theorem and its corollary show the change in the Euclidean distance be- 
tween 2 points and the dot products when they are projected onto a lower dimensional 
space [1]. 

Lemma 2 Let u,v £ R'^. Let u = and v' — he the projections of u and v 

to R^ via a random matrix R whose entries are chosen independently from N{0, 1) or 
U{1, 1). Then for any e > 0, the following bounds hold 

(1 — — u|P < \ \u — v' W'^ 

2_ 3-,^ 

With probability at least 1 — e ' and 

\\u' -v'\f<{l + e)\\u-v\\^ 

with probability at least 1 — 



A corollary of the above theorem shows how well the dot products are preserved upon 
projection(This is a slight modification of the corollary given in p]). 

Corollary 1 Let u,v be vectors m R'^ s.t. \\u\\ < Li, \\v\\ < L2 ■ Let R he a random 
matrix whose entries are chosen independently from either N{Q, 1) or (7(1, 1). Define 

u' = and v = S^jS , Then for any e > 0, the following bound holds 

vfc vk 

u-v~ |(Li + lI) <,u ■ V <u - V + ^{L\+ L2) 
With probability at least 1 — 4e s . 
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Proof For the vectors u and v, let the event Ei be 



(1 



e)||it — v\\'^ < \\u — v' 



2 < [l + e)\\u-v\f 




e)\\u + v\\^ < \\u' + v' 



^ < (1 + + 



P{Ei and S2) > 1 



4e 



Now, 




) 



The above inequahty holds with probability greater than or equal to 1— 2e ^'^ ' 4 . 
Similarly, 



2.2 Large scale classification 

We look at a few algorithms which focus on large scale classification. [10] presented 
a SVM formulation called Proximal SVM in which the objective is a non linear least 
squares function and the inequality constraints are replaced by a system of equations. 
Finding the best separating hyperplane now involves solving this system of equations. 
This is done by inverting a d x d matrix, as a result of which the method is not feasible 
for datasets like text for which d is very high. Also, the method involves a matrix 
multiplication H where iJ" is a n x (d + 1) matrix. So the entire data matrix needs 
to be kept in memory and hence the method is not scalable in terms of memory. 

[15] presented an algorithm L2-SVM-MFN which uses a conjugate gradient method 
to solve the SVM problem and thus does not have to perform any matrix inversion as 
the previous method. Results in their paper indicate that the algorithm performs very 
well for large high dimensional datasets like text. Analysis of the algorithm indicates 
that it accesses the data vectors in a sequential manner and hence does not have to 
keep the data matrix in main memory, making it scalable in terms of memory. 

Our work is closely related to 213] • They propose that d be used as the combina- 
torial dimension of the problem for the separable case. The dual of the SVM problem, 
when the data is linearly separable, is the minimum distance between the 2 convex hulls 
of the positive and negative examples. When the data is not linearly separable, these 2 
hulls overlap. This can reduced to the separable case, by condensing the 2 hulls [5]. This 
is done as follows. Let Z be the set of composed examples zj where zj = — — 'Z, ^'"^ , 




holds with probability greater than or equal to 1 — 2e ^'^ ' <i . 



n 
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where each Xi^ is a distinct element of D and all the points defining a zj have the same 
label and the label of zj is the same(For details on this condensation, see their paper). 
In this case, we have \Z\ < and (m + 1) < A < m{d +1). It is this aspect of the 
SVM problem which was used by the authors to develop a randomized algorithm to 
solve the problem, given in Algorithm [1] 

The algorithm proceeds in multiple iterations, where in each iteration it picks up 
a subset of the training data S, such that the size of the subset, r, is greater than 
the number of support vectors. Any SVM solver can be used to train a classifier C 
on the sampled subset, which is smaller than the entire data. Based on the classifier 
C obtained, the sampling probabilities are changed for the training data such that in 
successive iterations, the support vectors have a higher probability of selection. This 
process is repeated until the number of misclassified documents v = Q. The termination 
of the algorithm is guaranteed in a probabilistic fashion in [8]. The authors recommend 
using m[d + 1) as an estimate of A. This choice of A makes the subset size too large 
for high dimensional datasets, making it impractical. 

To overcome this problem we use ideas from random projections [14ll9l[T]. Consider 
projecting the data points into a random k dimensional subspace where k « d. Using 
this idea, we give a theoretical bound on the combinatorial dimension A which is much 
lesser than the original data dimension d, in the almost linearly separable case. In 
practice, it has been observed that A is even lower. We then apply this to make the 
above algorithm scalable (without actually performing any random projection of the 
data). 

3 Classification 

This section uses results from random projections, and randomized algorithms for linear 
programming to develop a new algorithm for solving large scale SVM classification 
problems. In Section 13.11 we discuss the case of linearly separable data and estimate 
a the number of support vectors required such that the margin is preserved with high 
probability, and show that this number is much smaller than the data dimension d, 
using ideas from random projections. In Section we look at how the analysis applies 
to almost separable data and present the main result of the paper(Theorem [Sjl . The 
section ends with a discussion on the application of the theory to non- linear kernels. 
In Section [3.31 we present the randomized algorithm from SVM learning. 

3.1 Linearly separable data 

We start with determining the dimension k of the target space such that on per- 
forming a random projection to the space, the Euclidean distances and dot products 
are preserved. The appendix contains a few results from random projections which 
will be used in this section. For a linearly separable data set D = {{xi,yi), i = 
1, . . . ,n},Xi £ R'^, yi e { + 1, -1}, the C-SVM formulation is the same as C-SVM-1 
with = Q, i = 1 . . .n. By dividing all the constraints with ||™||, the problem can be 
reformulated as follows: 
C-SVM-2a: 



max(^ f, ;) I 
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s.t.yi{w ■ Xi + b) > l,i = 1 . . .n, | | = 1 

where w — jj^, b — and f = jp^- ' is the margin induced by the separating 

hyperplanes, that is, it is the distance between the 2 supporting hyperplanes. 

The determination of k proceeds as follows. First, for any given value of k, we show 
the change in the margin as a function of k when the data points are projected onto 
the k dimensional subspace and the problem solved. From this, we determine the value 
k{k « d) which will preserve margin with a very high probability. In a dimensional 
subspace, there are at the most k + 1 support vectors. Using the idea of orthogonal 
esfensions (definition appears later in this section), we prove that when the problem 
is solved in the original space, using an estimate of fc + 1 on the number of support 
vectors, the margin is preserved with a very high probability. 

Let w' and a;^, i = 1, . . . , n be the projection of w and Xi,i = 1, . . . , n respectively 
onto a k dimensional subspace (as in Lemma [^l- The classification problem in the 
projected space with the data set being D' — {{x'^,yi),i — 1, . . . ,n},x^ £ R^^yi G 
{+1, —1} can be written as follows: 
C-SVM-2b: 

Maximize, , ; 

Subject to : yi{w' ■ x'^ +b) > l' , i — 1 . . . n, \\w'\\ < 1 

where l' = 1{1 — 7), 7 is the distortion and < 7 < 1. The following theorem predicts, 
for a given value of 7, the k such that the margin is preserved with a high probability 
upon projection. 

Theorem 1 Let L = maa;||a;i||, and {w* ,b* ,1*) be the optimal solution for C-SVM- 
2a. Let R be a random d x k matrix as given in Lemma Let ^ = and 

A = = l,...,n. Ifk > -§,(1 + 11+^-1)2 log ^, < 7 < 1, < 5 < 1, then 

the following bound holds on the optimal margin Ip obtained by solving the problem 
C-SVM-2b: 

p{ip > r(i-7)) > 1-5 

Proof From Corollary [T] of Lemma [21 we have 

w* ■ Xi — —(1 + L^) < w ■ Xi < w* ■ Xi + —(1 + L^) 

_f2 k 

which holds with probability at least 1— 4e « ^ for some e > 0. Consider some example 
Xi with yi = 1. Then the following holds with probability at least 1 — 2e s 

w ■ x', + b* > w* ■ Xi ~ ^{1 + L^) +b* >l* - |(1 + L^) 
Dividing the above by we have 

w-x'^ + b* ^ r- §(i + l2) 

Note that from Lemma [T] we have -^(1 — e)\\w* \ < \\w\\ < \/ {1 + e)\\w* 1 1, with 
probability at least 1 — 2e~'' . Since \\w*\\ = 1, we have ^/l — e < \\w\\ < ^/l + e. 
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Hence 

w-x'i + b* ^ I* - f(l + L^) 



>il*--{l + L'))(VT^e) > r(l-— (1 + l2))(1-.) 



2/* ' - ^ V 21* " 

This holds with probability at least 1 — 4e s . A similar result can be derived for a 



point Xj for which yj — —1. The above analysis guarantees that by projecting onto a k 
dimensional space, there exis 
a margin of T (1 — 7) where 



dimensional space, there exists at least one hyperplane (^^, -^), which guarantees 

\\w\\ \\w\\ 



with probability at least 1 — n4e » . The margin obtained by solving the problem 
C-SVM-2b, Ip can only be better than this. So the value of k is given by: 

n4e ^^^15^^ < 5 ^ fc> 8(^ + 71-) logjf^ (5) 
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n 



So by randomly projecting the points onto a k dimensional subspace, the margin 
is preserved with a high probability. This result is similar to the results in large scale 
learning using random projections [l][2]. But there are fundamental differences between 
the method proposed in this paper and the previous methods: no random projection 
is actually done here, and no black box access to the data distribution is required. We 
use Theorem [1] to determine an estimate on the number of support vectors such that 
margin is preserved with a high probability, when the problem is solved in the original 
space. This is given in Theorem [5] and is the main contribution of this section. The 
theorem is based on the following fact: in a fc dimensional space, the number of support 
vectors is upper bounded by + 1. We show that this fc + 1 can be used as an estimate 
of the number of support vectors in the original space such that the solution obtained 
preserves the margin with a high probability. We start with the following definition. 

Definition 2 (Orthogonal extension) An orthogonal extension of a {k~l)-dimensional 
flat( a (fc — 1) dimensional flat is a (k — 1)- dimensional affine space) hp = {wp,b), 
where Wp — {wi, . . . jW^.), in a subspace Sk of dimension k to a d — 1-dimensional 
hyperplane h = (it), 6) in d-dimensional space, is defined as follows. Let R G j^dxtl 
be a random projection matrix as in Lemma 2. Let R G Jid-x-k ^ another random 
projection matrix which consists of only the the first fc columns of R. Let Xi = R^Xi 
and — ^^Xi.Let Wp — {wi, . . . ,w^.) be the optimal hyperplane classifier with mar- 
gin Ip for the points x'i,...,x'n in the fc dimensional subspace. Now define w to be 
all O's in the last d — k coordinates and identical to Wp in the first k coordinates, 
that is, w = {wi, . . . , uifc, 0, . . . , 0). Orthogonal extensions have the following key prop- 
erty. If {wp,b) is a separator with margin Ip for the projected points, then its orthog- 
onal extension {w,b) is a separator with margin Ip for the original points,that is, if 
yi{wp ■ x'^ -\- b) > I, i — 1, . . . ,n, then yi{w ■ x^ -\- b) > I, i = 1, . . . ,n. 
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An important point to note, which will be required when extending orthogonal exten- 
sions to non-linear kernels, is that dot products between the points are preserved upon 
doing orthogonal projections, that is, x'^^X'j = Xi^xj. 

Let L,l*,j,S and n be as defined in Theorem [1] The following is the main result 
of this section. 

Theorem 2 Given k > + log ^ and n training points with maximum 

norm L in d dimensional space and separable by a hyperplane with margin I* , there 
exists a subset ofk' training points xi . . .x'f. where k' < k and a hyperplane h satisfying 
the following conditions: 

1. h has margin at least — 7) with probability at least 1 — 5 

2. xi . . .x'f^ are the only training points which lie either on hi or on h2 

Proof Let w* ,b* denote the normal to a separating hyperplane with margin I* , that is, 
yi{w* -Xi + b*) > I* for all x^ and = 1. Consider a random projection of xi, . . . , Xn 

to a fc dimensional space and let w',zi,...,Zn be the projections of w* , xi, . . . , Xn, 
respectively, scaled by 1/y/k. By Theorem 1, yi{w' ■ Zi + b* /\\w'\\) > Z* (1 — 7) holds for 
all Zi with probability at least 1 — 5. Let h be the orthogonal extension of {w' , b* /\\w'\) 
to the full d dimensional space. Then h has margin at least I* (1 — 7), as required. This 
shows the first part of the claim. 

To prove the second part, consider the projected training points which lie on either 
of the two supporting hyperplanes. Barring degeneracies, there are at the most k such 
points. Clearly, these will be the only points which lie on the orthogonal extension h, 
by definition. □ 

From the above analysis, it is seen that if A; << d, then we can estimate that the 
number of support vectors is k + 1, and the algorithm RandSVM would take on average 
0(A:logn) iterations to solve the problem [4l[3]. 



3.2 Almost separable data 

In this section, we look at how the above analysis can be applied to almost separable 
data sets. We call a data set almost separable if by removing a fraction k ^ 0{^-^) 
of the points, the data set becomes linearly separable. 

The C-SVM formulation when the data is not linearly separable(and almost sepa- 
rable) was given in C-SVM-1. This problem can be reformulated as follows: 

n 

Minimizei^^^l,^^^ y^Cz 
1=1 

Subject to : yi{w ■ x^ + b) > I — £,i> 0, i = 1 . . . n; | |ui| | < y 

This formulation is known as the Generalized Optimal Hyperplane formulation. Here I 
depends on the value of C in the C-formulation. At optimality, the margin I* — I. The 
following theorem proves a result for almost separable data similar to the one proved 
in Theorem [2] for separable data. 
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Theorem 3 Given k > ^ (1+ )^ ^'^S +'^'^) '* fceiwff fie margin at optimality, 

I the lower bound on I* as in the Generalized Optimal Hyperplane formulation and k, — 
Q( ^°fi" ); there exists a subset of k' training points xi . . .x^, k' < k and a hyperplane 
h satisfying the following conditions: 

1. h has margin at least i(l — 7) with probability at least 1 — 5 

2. At the most — log ^ points lie on the planes hi or on /i2 

3. xi, . . . ,x'f. are the only points which define the hyperplane h, that is, they are the 
support vectors of h. 

Proof Let the optimal solution for the generalized optimal hyperplane formulation be 
{w*,b*,e)-w* ^ ^ aiPiXi, and I* = \\w'\\ mentioned before. The set of support 

i:ai>0 

vectors can be split into to 2 disjoint sets,^^! = {xi : > and ^* — 0} (unbounded 
SVs) and SV2 = {xi:ai>0 and > 0}(bounded SVs). 

Now, consider removing the points in SV2 from the data set. Then the data set 
becomes linearly separable with margin I*. Using an analysis similar to Theorem [TJ 
and the fact that /* > I, we have the proof for the first 2 conditions. 

When all the points in SV2 are added back, at most all these points are added to 
the set of support vectors and the margin does not change; this is guaranteed by the 
fact that we have assumed the worst possible margin for proving conditions 1 and 2, 
and any value lower than this would violate the constraints of the problem. This proves 
condition 3. □ 

Hence the number of support vectors, such that the margin is preserved with high 
probability, is 

Using a non-linear kernel: Consider a mapping function $ : R"^ ^ , d' > d, which 
maps a point Xi £ R'^ to a point Zj £ R'^ , where i?'' is a Euclidean space. Let the 
points zi, . . . , Zn be projected onto a random k dimensional subspace as before. The 
lemmas in the appendix are applicable to these random projections[2]. The orthogonal 
extensions can be considered as an projection from the k dimensional space to the 
<f-space, such that the kernel function values are preserved. Then it can be shown that 
Theorem |3] applies when using non- linear kernels also. 



3.3 A Randomized Algorithm 

The reduction in the sample size from 6d^ to 6fc^ is not enough to make RandSVM 
useful in practice as 6k^ is still a large number. This section presents another random- 
ized algorithm which only requires that the sample size be greater than the number 
of support vectors. Hence a sample size linear in k can be used in the algorithm. This 
algorithm was first proposed to solve large scale LP problems [17]; it has been adapted 
for solving large scale SVM problems. The 
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Algorithm 2 RandSVM-l(D,k,r) 
Require: D - The data set. 

Require: k - The estimate of the number of support vectors. 

Require: r - Sample size = cfc, c > 0. 
1: S = randomsubset(_D, r); // Pick a random subset, S, of size r from the data set D 
2: SV = svmlearn({}, 5); / / SV - set of support vectors obtained by solving the problem S 
3: V = {x G D — S\violates{x, SV)} //violator - nonsampled point not satisfying KKT 

conditions 
4: while > and 15^1 < fc) do 

5: -R = randomsubset(y , r — \SV\); / /Pick a random subset from the set of violators 

6: SV = svmlearn(5V, -fJ); / / SV - set of support vectors obtained by solving the problem 

SVUR 
7: SV = SV; 

8: V = {x D — {SV U R)\violates{x, SV)}; / /Determine violators from nonsampled set 
9: end while 
10: return SV 



Proof of Convergence: Let SV be the current set of support vectors. Condition \SV\ < 
k comes from Theorem 3. Hence if the condition is violated, then the algorithm termi- 
nates with a solution which is near optimal with a very high probability. 
Now consider the case where \SV\ < k and \V\ > 0. Let Xi be a violator(xj is a 
non-sampled point such that yi{w'^Xi 4-6) < 1). Solving the problem with the set of 
constraints as SV U Xi will only result, since SVM is an instance of AOP, in the in- 
crease(decrease) of the objective function of the primal(dual) . As there are only finite 
number of basis for an AOP, the algorithm is bound to terminate; also if termina- 
tion happens with the number of violators equal to zero, then the solution obtained is 
optimal. 

Determination of k: The value of k depends on the margin I* which is not available in 
case of C-SVM. This can be handled only by solving for fc as a function of e, where e is 
as defined in the appendix and Theorem 1. This can be done by combining Equation!?] 
with Equation [6] 

21* ) log— +0(logn)> -^(1+ ) log— )> -2 log— (7) 

4 Regression 

Let us define a dataset D = {(xi,yi)\l < i < n,Xi £ R'^,yi G R} to be linear, for a 

fixed e > 0, if the following formulation is feasible. 

SVR-1: 

subject to: Vi — w ■ Xi — b < e 
w ■ Xi + b~ yi < e 

This is the SVM regression formulation in which D is constrained to lie in a e— tube. 
The lagrangian is given as jC{w, b, a^, a^) = ^w'^w + o^iyi — w ■ Xi — b — e) + 
■ a~ {w ■ Xi + b ~ yi ~ e). By KKT condition, the optimal solution will have w* = 
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^■(a^ — )xi. The set of support vectors is union of two disjoint sets given as: 
{i : af > 0} U {i : a~ > 0}. We would like to develop randomized algorithms which 
can solve such problems where d and n are large. 

Let x'^ be the projection of a::^, i = 1 . . .n onto a fc— dimensional subspace. The re- 
gression problem in the projected space is given by 
SVR-2: 



subject to: Vi — w' ■ x'^ — b < e 
w' ■ x[ + h — Ui < e' 

where e' — e(l + 7); 7 is the distortion. The following theorem predicts the value of 
k such that the e-tube is preserved, with a minor distortion, with a high probability 
upon projection. 

Theorem 4 Let L = max||2:j||, and {w* ,h*), ||u;|| = W be the optimal solution for 
SVR — 1. Let R be a random d X k matrix as given in Lemma[^ Let w — md 

x'.^ = i^l,...,n.Ifk> 32(W^"+^") log <S < 1, then the following bound 

holds on the optimal regressor {wp,bp) obtained by solving the problem SVR — 2: 

P{\wp -x'. + bp^ y,\ < e(l + 7)) > 1 - 5 

Proof From Corollary [T] of Lemma (2] we have: 

w* -x,^ ^{W'^ + L'^)<w-x-<w* ■ X, + ^{W"^ + L^) 

which holds with probability at least 1 — 4e is. So, 

w ■ x'^ + b* - < w* ■ Xi + b* - + y (^^^ + 



< e + -i-{W^ + L^) = e(l + ^{W^ + L^)) 



2 ' ' ^ 2e 

holds with probability at least 1 — 2e is. Similarly 

yr-w-x[-b* <e+^{W^ ^L') 

holds with probability at least 1 — 2e is. The above analysis guarantees that upon 
projection onto a fc— dimensional plane, there exists (w, h*) which guarantees an e— tube 
of e(l +7), where 

with probability at least 1 — 4e 1 s . So the value of k is given by: 



<S^k>2 

76 

So, upon projection, there exists a regressor which preserves the e-tube with a high 
probability. The regressor obtained by solving SVR — 2 can only do better than this. 
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Let L,W, Sj-jjU, e be as defined in Theorem 14] and {w,b) be the orthogonal extension 
of {w' ,b) to R"^ as in Lemma[2] Then, we get: 

Theorem 5 Given k > 2^ ^^^ ^ log ^ andn training points with maximum norm 
L in d dimensional space for which the SVR-1 problem with margin e has a solution, 
there exists a subset of k' training points xi . . .x^.i where k' < k and h — {{x,y)\y — 
w ■ X — b — e} \^{{x , y)\w ■ x + b — y = e} satisfying the following conditions: 

1. {w,b) is the solution to a SVR-1 with margin at most e(l +7). 

2. xi . . . Xf^i are the only training points which are in h. 

Proof Let w* , b* denote the optimal regressor for problem SVR-1 with margin e, that 
is, w* ■ Xi + b* — yi < e and yi — w* ■ x^ — b* < e for all x^. Let w and x[ be the random 
projection of w* and Xi as outlined in Theorem|4] Then, \w' ■ x[ + b* — yi\ < e(l + 7) 
with probability at least (1 — 5). Let {w, b*) be the orthogonal extension of {w' , b*) to 
the full d dimensional space. 

\w-Xi + b* -yi\ = \w' -x'i + b* -yi\ < e(l + 7) 

Therefore, {w,b*) is a solution to SVR-1 with margin at most e(l +7). 

To prove the second part, consider the projected training points which lie on h' = 
{{x',y)\y -w'-x'-b* = e(l + 7)} \J{{x' ,y)\w' ■ x' + b* - y = e{l + 7)}. Barring 
degeneracies, there are at the most k such points. Clearly, these will be the only points 
which lie on the orthogonal extension h, by definition. 

Consider the problem: SVR-3: 

min^,?; ^IhlP 
subject to: yi ~ w ■ Xi — b < e + 

w ■ Xi + b - yi < e + ^i, ^i > 

Analogous to the notion of almost separability in the context of classification we 
define the notion of almost linear as follows: the data set D = \X'^%-,yi)Yi=\ is almost 
linear if by removing a fraction k — 0( ^°^" ) of the points, there exists a solution to the 
SVR — 1 problem for some chosen e > 0. The problem SVR — 3 is almost linear, if the 
optimal solution (w*,&*,^*) has the cardinality of the set {i : > 0} as 0(logn/n). 
This next theorem presents the result for almost separable data set for regression. 

Theorem 6 Given k > 2^ ^^^ ^ log ^ + nn and n training points with maximum 
norm L in d dimensional space for which the SVR — 3 problem with margin e has an 
almost separable optimal solution, there exists a subset of k' training points xi . . .x'^ 
where k' < k and h — {{x, y)\y — w ■ x — b — e} [J{{x, y)\w ■ x + b ~ y = e} satisfying 
the following conditions: 

1. {w,b,S,) IS the solution to a hard e—tube regression problem with margin e(l +7). 

2. At the most 2 ^ ^ ^ log ^ points lie on the plane h. 

3. xi . . . x'^. are the only training points which lie on h. 
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Proof Let the optimal solution for the SVR-3 formulation be {w* , b* , * ) . The set 
of support vectors can be split into to 2 disjoint sets,SVi — {xi : a; > and — 
0} (unbounded SVs) and SV2 = {x^ : at > and ^* > 0} (bounded SVs). 

Now, consider removing the points in SV2 from the data set. Then the data set 
becomes linearly separable. Using an analysis similar to Theorem |4l we have the proof 
for the first 2 conditions. 

When all the points in SV2 are added back, at most all these points are added to 
the set of support vectors and the margin e(l + 7) does not change; this is guaranteed 
by the fact that we have assumed the worst possible margin for proving conditions 1 
and 2, and any value lower than this would violate the constraints of the problem. This 
proves condition 3. 

5 Experiments 

5.1 Classification 

This section discusses the performance of RandSVM in practice. The experiments were 
performed on 4 data sets: 3 synthetic and 1 real world. RandSVM was used with 
LibSVM as the solver when using a non-linear kernel; with SVMLight for a linear 
kernel. RandSVM has been compared with state of the art SVM solvers: LibSVM [7] 
for non- linear kernels, and SVMPerQ nd SVMLiiJl for linear kernels. 

5.1.1 Synthetic data sets 

The twonorm data set is a 2 class problem where each class is drawn from a multivariate 

normal distribution with unit variance. Each vector is a 20 dimensional vector. One 

class has mean {a,a, . . . ,a), and the other class has mean (— a, — a, . . . , — a), where 

a — The ringnorm data set is a 2 class problem with each vector consisting of 

20 dimensions. Each class is drawn from a multivariate normal distribution. One class 

has mean 1, and covariance 4 times the identity. The other class has mean (a, a, . . . , a), 

and unit covariance where a — — 2=. 

V20 

The checkerboard data set consists of vectors in a 2 dimensional space. The points 
are generated in a 4 x 4 grid. Both the classes are generated from a multivariate 
uniform distribution; each point is {xl — (7(0, 4), 2:2 — (7(0, 4)). The points are labeled 
as follows - if {xl%2 = x2%2), then the point is labeled negative, else the point is 
labeled positive. For each of the synthetic data sets, a training set of 10,00,000 points 
and a test set of 10,000 points was generated. A smaller subset of 1,00,000 points was 
chosen from training set for parameter tuning. From now on, the smaller training set 
will have a subscript of 1 and the larger training set will have a subscript of 2, for 
example, ringnorm 1 and ringnorm2 . 

5.1.2 Real world data set 

The RCVl 16 data set consists of 804,414 documents, with each document consisting 
of 47,236 features. Experiments were performed using 2 categories of the data set - 

^ http://svmlight.joachims.org/ 

^ http://people.cs.uchicago.edu/ vikass/svmlin.html 
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Table 1 Classification: Timing and accuracy(in brackets) comparison 



Category 


Kernel 


RandSVM 


LibSVM 


SVMPerf 


SVMLin 


twonorml 


Gaussian 


300 (94.98%) 


8542 (96.48%) 


X 


X 


twonorm2 


Gaussian 


437 (94.71%) 




X 


X 


ringnorml 


Gaussian 


2637 (70.66%) 


256 (70.31%) 


X 


X 


ringnorm2 


Gaussian 


4982 (65.74%) 


85124 (65.34%) 


X 


X 


checkerboardl 


Gaussian 


406 (93.70%) 


1568.93 (96.90%) 


X 


X 


checkerboard2 


Gaussian 


814 (94.10%) 




X 


X 


CCAT 


Linear 


345 (94.37%) 


X 


148 (94.38%) 


429(95.1913%) 


Cll 


Linear 


449 (96.57%) 


X 


120 (97.53%) 


295 (97.71%) 



CCAT and Cll. The data set was split into a training set of 7,00,000 documents and 
a test set of 104,414 documents. 

TabIe[T]shows the kernels which were used for each of the data sets. The parameters 
used {a and C for Gaussian kernels, and C for linear kernels) were obtained by tuning 
using grid search. 



Selection of k for RandSVM: The values of e and S were fixed to 0.2 and 0.9 respec- 
tively, for all the data sets. For linearly separable data sets, k was set to (16 log(4n/(5))/e^ 
. For the others, k was set to (32 log(4n/(S))/e^. 



5.1.3 Discussion of results: 

Table [T] has the timing and classification accuracy comparisons. The subscripts 1 and 
2 indicate that the corresponding training set sizes are 10^ and 10^ respectively. A 
'-' indicates that the solver did not finish execution even after a running for a day. 
A 'X' indicates that the experiment is not applicable for the corresponding solver. 
The indicates that the solver used with RandSVM was SVMLight; otherwise it was 
LibSVM. 

The table shows that RandSVM can scale up SVM solvers for very large data sets. 
Using just a small wrapper around the solvers, RandSVM has scaled up SVMLight so 
that its performance is comparable to that of state of the art solvers such as SVMPerf 
and SVMLin. Similarly LibSVM has been made capable of quickly solving problems 
which it could not do before, even after executing for a day. In the case of ringnorm 
1 dataset, the time taken by LibSVM is very small. Hence not much advantage is 
gained by solving smaller sub-problems; this combined with the overheads involved 
in RandSVM resulted in such a slow execution. Hence RandSVM may not always be 
suited in the case of small datasets. 

It is clear, from the experiments on the synthetic data sets, that the execution 
times taken by RandSVM for training with 10^ examples and 10^ examples are not 
too far apart; this is a clear indication that the algorithm scales well with the increase 
in the training set size. 

All the runs of RandSVM except ringnorm 1 terminated with the condition \SV\ < 
k being violated. Since the classification accuracies obtained by using RandSVM and 
the baseline solvers are very close, it is clear that Theorem |3] holds in practice. 
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Table 2 Regression rcsults(t denote RBF kernel) 



RandSVM LIBSVM SVW^^ 





time 


MSE(p) 


time 


MSE(p) 


time 


MSE(p) 


l.(a)t 

1. (b) 

2. {a) 


42 
1489 

201 


2.3259(0.9502) 
1.3813(0.9727) 

0.0319(0.4625) 


5.61 
913.6 
2650.3 
1645.8t 


1.8249(0.922) 
2.9916(0.9253) 
0.0320(0.4600) 
0.02621(0.3502)t 


21.10 
4114.66 

336.64 


2.2897(0.9509) 
1.2173(0.9753) 

0.0320(0.4607) 


2.(b) 


327 


68.24% 


4459.8 


68.49% 


570.07 


68.32% 


3. 


713 


0.0320 (0.7894) 


5671. 2t 


0.0315(0.769755)t 


460.36 


0.0317(0.7896) 



5.2 Regression 

The experiments were done on 1 synthetic datasets and 2 real world datasets - Forest 
Cover [g and MNISlO. RandSVM was compared with SVMLight and LibSVM [7]. 
Table[2]gives the execution time(in seconds), mean square error(MSE) and correlation 
coefRcient(p) for e— regression. A linear kernel is used unless specified. The value of k 
is calculated according to fc = ^^-^^^^^jr^- A- value of e' = 0.2 and S = 0.1 is used. The 
datasets are as following: 



5.2.1 Synthetic: 

The input attributes {xi, . . . ,xiq) are generated independently, each of which is dis- 
tributed uniformly over [0, 1]. The target is defined hy y = 10 sin(7ra;ia;2) + 20(^3 — 
0.5) + 10x4 + 5xc, + N{0, 1). A value of e = 1.0 is chosen. Two run are done for training 
set size of (a) 10^ and (b) 10^ respectively. 



5.2.2 Forest Cover: 

There are 581012 records with label in {0, . . . , 6} and 54 features. The classification 
problem was transformed into a regression problem as follows: 

a) Predict the class labels with features scaled to [0, 1] and e = 0.1. 

b) Predict +1 for examples for class 2 and -1 for examples of other classes. Since class 
2 is over represented, this leads to a more balanced problem. The features are scaled 
to [0, 1] and a value of e = 0.1 is chosen. 



5.2.3 MNIST: 

The data has 60000 training points and 10000 test points. There are 784 features each 
in {0, . . . , 255} and 10 class labels {0, . . . , 9} which are used as target for regression 
estimate. The features are scaled to [0, 1] and a value of e = 0.1 and 5 = 0.9 is used for 
regression. 



http:/ /yann. lecun.com/exdb/mnist/ 
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6 Conclusions 

A large number of learning problems can be viewed as instances of abstract optimiza- 
tion problem (AOP), which has an associated combinatorial dimension A. An AOP 
can be solved efficiently, with a high degree of accuracy, by selecting subsets of the size 
of order of the combinatorial dimension of the problem. However, computing the com- 
binatorial dimension of an AOP is not a trivial task. In this paper, we have used ideas 
from random projections to obtain estimates to the combinatorial dimension for SVM 
formulations of classification and regression tasks with extremely promising results. 
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